11/26/2019

DRY vs WET

  • Don’t repeat yourself (DRY solution) is a principle of software development aimed at reducing repetition to avoid redundancy.

  • Write every time (WET solution) is the violation of DRY.

“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”

— The Pragmatic Programmer by Andy Hunt and Dave Thomas

Example

df
##     a b c d   e
## 1   6 4 7 1   9
## 2 -99 4 3 6 -99
## 3   5 1 7 2  10
## 4   5 7 4 5   8
## 5   5 2 6 3   1
df$a[df$a == -99] <- NA
df$b[df$b == -99] <- NA
df$c[df$c == -98] <- NA
df$d[df$b == -99] <- NA
df$e[df$e == -99] <- NA
— Advanced R by Hadley Wickham

Example

fix_missing <- function(x) {
  x[x == -99] <- NA
  x
}
df[] <- lapply(df, fix_missing)
— Advanced R by Hadley Wickham

Outline

  • Brief revision of machine learning
    • The Three Musketeers: \(k\)-nearest neighbors, logistic regression, neural networks
    • Hyperparameters vs parameters
  • Hands-on caret package

Machine learning

Machine Learning is the filed of study that gives computers the ability to learn without being explicitly programmed.

— Arthur Samuel, 1959

ML roadmap: types of algorithms

  • Supervised learning: learn the mapping function \(f\) from an input \(X\) to an output \(y\).

    \[y = f(X)\]

    The (labled) data contains both the input (features \(X\)) and the output (outcome \(y\)).

    • Classification: categorical outcome \(y\)
    • Regression: numerical outcome \(y\)
  • Unsupervised learning: learn the structures/patterns in the data.

  • Reinforcement learning: learn the best strategy/policy.

Selected classification algorithms

\(k\)-nearest neighbors

\(k\)-nearest neighbors

\(k\)-nearest neighbors

\(k\)-nearest neighbors

\(k\)-nearest neighbors

How many neighbors should we use?

How did you get this number?

Distance and scaling

  • Minkowski distance:

\[D(X_1, X_2) = \left( \sum_{j = 1}^{p} |x_{1, j} - x_{2, j}|^q \right)^\frac{1}{q}, \quad q\geq 1.\]

  • Centering and scaling:

\[x_{i, j} = \frac{x_{i, j} - \mu_j}{\sigma_j},\] \[\quad i = 1, ..., n \quad j = 1, ..., p. \]

Example: predicting IQ by the average grade (1-6) and the weight of the brain (1100-1600 g).

Logistic regression

  • The linear predictor:

\[\eta(X) = \alpha_0 + \alpha_1 x_1 + ... + \alpha_p x_p, \]

  • The probability prediction:

\[\mathbb{P}(Y = 1|X) = \frac{e^{\eta(X)}}{1 + e^{\eta(X)}}.\]

  • The prediction of the class:

\[y(X) = \begin{cases}1, \quad \text{if} \quad \mathbb{P}(Y = 1|X) \geq t, \\0, \quad \text{if} \quad \mathbb{P}(Y = 1|X) < t. \end{cases}\]

How do we get values of coefficients?

Neural networks

How to choose the number of nodes in the hidden layer? How do we get weights?

Parameters vs hyperparameters

The difference between parameters and hyperparameters.

  • Hyperparameters are the parameters of the learning algorithm; it must be set prior to training and remains constant during training. Their values cannot be estimated from data and must be tuned. For example, the number of neighbors in KNN or the number of hidden layers in neural networks. The logistic regression does not have any hyperparameters.

  • Parameters are estimated from the data through the fitting or training process. For example, the weights of neural network are parameters, as well as coefficients of the logistic regression. The KNN algorithm does not have parameters.

Metric: accuracy

Predicted Negative Predicted Positive
Observed Negative True Negative (TN) False Positive (FP)
Observed Positive False Negative (FN) True Positive (TP)

\[\text{Accuracy} = \frac{\text{TN} + \text{TP}}{\text{TN} + \text{FN} + \text{FP} + \text{TP}}\]

Train/test split: out-of-sample accuracy

Data respampling: a cure for overfitting

Motivaton: data scientist’s nightmare

  • k-Nearest Neighbors (class package, knn function):
    • Input: train, test, cl
    • Output: predicted values
  • Logistic regression (stats package, glm function):
    • Input: formula, data OR x, y
    • Output: an object of class glm
  • Neural Networks (nnet package, nnet function):
    • Input: formula, data OR x, y
    • Output: an object of class nnet

Not all packages have a function that supplies both x, y and formula, data as arguments (e.g., xgboost or keras).

Motivaton: data scientist’s nightmare

The function predict:

  • k-Nearest Neighbors (class package, knn function):
    • None
  • Logistic regression (glm function):
    • type = c("link", "response", "terms")
  • Neural Networks (nnet package):
    • type = c("raw", "class")

Motivaton: data scientist’s nightmare

Hyperparameters to tune:

  • k-Nearest Neighbors (class package, knn function):
    • k
  • Logistic regression (glm function):
    • None
  • Neural Networks (nnet package):
    • size, decay

Motivaton: data scientist’s nightmare

In a nutshell, models/algorithms covered by R packages are not consistent and have:

  • their own syntax
  • unique hyperparameters to tune
  • different requirements on the data format
  • etc.

The same model can be implemented in several R packages (e.g., neural networks appear in nnet, neuralnet, h2o, darch, mxnet, deepnet, etc.).

caret

The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models (supervised machine learning). The package contains tools for:

  • data splitting (e.g., train vs test)
  • pre-processing (e.g., centering and scaling)
  • feature selection
  • model tuning using resampling (e.g., cross-validation)
  • variable importance estimation
  • etc.

The package provides an interface to the vast majority (238!) models from 30 packages.

Installing and loading caret

To install caret simply use the following command (Note: you have to do it once):

install.packages("caret")

To load the package one uses:

library("caret")

Remember, caret does not contain all models, but rather a wrapper for those models. Therefore, it makes sense to load the package/model only when it is used. So does caret. However, caret assumes that models that you use have been already installed.

The Beatles – All we need is love http://tiny.cc/caRet

Data example

The wine data covers vinho verde wine samples, from the north of Portugal. The dataset has 4898 number of instances and contains 12 columns

  • 11 numerical columns: physicochemical characteristics of wines, features
  • 1 categorical column of levels "good" and "bad": the quality of wine, outcome variable

Data splitting: createDataPartition()

Fitting models without hyperparameter tuning

train()

train() function with self-explanatory name is the holy grail of caret package. Roughly speaking, it fits a model(s) (i.e., estimates its parameters).

The function has three main arguments: form, data, and method OR x, y, and method. In this tutorial, we work with form and data method to avoid confusions.

train()

The function has other important arguments:

  • method: a string specifying which classification or regression model to use
  • preProcess: a string vector that defines a pre-processing of the predictor data (e.g., "center", "scale")
  • metric: a string that specifies what summary metric (e.g., "Accuracy", "Kappa", etc.)
  • trControl: a list of values that define how this function acts (see further)
  • tuneGrid: a data frame with possible tuning values (i.e., set of values of hyperparameters, see further)
  • etc.

train()

The object returned by train() function can be printed and plotted (only if one tunes hyperparameters). Further, this object can be passed to predict.train().

The result of predict.train() can be further passed to confusionMatrix() to get all the metrics.

Models tuning using resampling

In the previous examples, values of hyperparameters, k (for knn), and size and decay (for nnet), were arbitrary. Perhaps, other values of k, and size and decay can yield a better result. We need to tune those hyperparameters.

In the next example we use the 5-fold cross-validation resampling method to tune models and accuracy to compare models.

trainControl()

caret automates the process of tuning, allowing for many techniques, such as (repeated) cross-validation, bootstrap, etc. To specify details of tuning, the function trainControl() can be used. It has the following arguments:

  • method: the resampling method: "boot", "boot632", "cv", "LOOCV", etc.
  • number: the number of folds or number of resampling iterations
  • etc.

tuneGrid argument of train() function

We need to define the possible values of hyperparameters, that will be tested using resampling. tuneGrid argument supplies the data frame, each row of which is the combinantion of such hyperparameters. It is very convinient to use expand.grid() built-in function to define all possible combinations.

Instead of specifying tuneGrid, one can pass tuneLength: caret will figure out by itself tuneGrid (i.e., values of hyperparameters to test).

How to look for a model and availible tuning hyperparameters?

Summary

  • caret is an interface that unifies the access to difference packages/models.

  • The main function of caret is train():
    • method defines the model
    • tuneGrid contains sets of parameters (expand.grid())
    • trControl is responsible for computational nuances (trainControl())
    • preProcess specifies pre-processing of predictors
  • The result of trainControl() is passed to train:
    • method: the resampling method
    • number: the number of resampling iterations

Thank you!